Current Issue : July - September Volume : 2020 Issue Number : 3 Articles : 5 Articles
This paper will present a new method of identifying Vietnamese voice commands using Google speech recognition (GSR) service\nresults. The problem is that the percentage of correct identifications of Vietnamese voice commands in the Google system is not\nhigh. We propose a supervised machine-learning approach to address cases in which Google incorrectly identifies voice\ncommands. First, we build a voice command dataset that includes hypotheses of GSR for each corresponding voice command.\nNext, we propose a correction system using support vector machine (SVM) and convolutional neural network (CNN) models.\nThe results show that the correction system reduces errors in recognizing Vietnamese voice commands from 35.06% to 7.08%\nusing the SVM model and 5.15% using the CNN model....
Speech recognition allows the machine to turn the speech signal into text\nthrough identification and understanding process. Extract the features, predict\nthe maximum likelihood, and generate the models of the input speech\nsignal are considered the most important steps to configure the Automatic\nSpeech Recognition System (ASR). In this paper, an automatic Arabic speech\nrecognition system was established using MATLAB and 24 Arabic words\nConsonant-Vowel Consonant-Vowel Consonant-Vowel (CVCVCV) was recorded\nfrom 19 Arabic native speakers, each speaker uttering the same word 3\ntimes (total 1368 words)..........................
Generating music with emotion similar to that of an input video is a very relevant issue nowadays. Video content creators and\nautomatic movie directors benefit from maintaining their viewers engaged, which can be facilitated by producing novel material\neliciting stronger emotions in them. Moreover, there is currently a demand for more empathetic computers to aid humans in\napplications such as augmenting the perception ability of visually- and/or hearing-impaired people. Current approaches overlook\nthe videoâ??s emotional characteristics in the music generation step, only consider static images instead of videos, are unable to\ngenerate novel music, and require a high level of human effort and skills. In this study, we propose a novel hybrid deep neural\nnetwork that uses an Adaptive Neuro-Fuzzy Inference System to predict a videoâ??s emotion from its visual features and a deep Long\nShort-Term Memory Recurrent Neural Network to generate its corresponding audio signals with similar emotional inkling. The\nformer is able to appropriately model emotions due to its fuzzy properties, and the latter is able to model data with dynamic time\nproperties well due to the availability of the previous hidden state information. The novelty of our proposed method lies in the\nextraction of visual emotional features in order to transform them into audio signals with corresponding emotional aspects for\nusers. Quantitative experiments show low mean absolute errors of 0.217 and 0.255 in the Lindsey and DEAP datasets, respectively,\nand similar global features in the spectrograms. This indicates that our model is able to appropriately perform domain\ntransformation between visual and audio features. Based on experimental results, our model can effectively generate an audio that\nmatches the scene eliciting a similar emotion from the viewer in both datasets, and music generated by our model is also chosen\nmore often (code available online at https://github.com/gcunhase/Emotional-Video-to-Audio-with-ANFIS-DeepRNN)....
Single-channel singing voice separation has been considered a difficult task, as it requires\npredicting two different audio sources independently from mixed vocal and instrument sounds\nrecorded by a single microphone. We propose a new singing voice separation approach based on the\ncurriculum learning framework, in which learning is started with only easy examples and then task\ndifficulty is gradually increased. In this study, we regard the data providing obviously dominant\ncharacteristics of a single source as an easy case and the other data as a difficult case. To quantify the\ndominance property between two sources, we define a dominance factor that determines a difficulty\nlevel according to relative intensity between vocal sound and instrument sound. If a given data is\ndetermined to provide obviously dominant characteristics of a single source according to the factor,\nit is regarded as an easy case; otherwise, it belongs to a difficult case. Early stages in the learning\nfocus on easy cases, thus allowing rapidly learning overall characteristics of each source. On the\nother hand, later stages handle difficult cases, allowing more careful and sophisticated learning.\nIn experiments conducted on three song datasets, the proposed approach demonstrated superior\nperformance compared to the conventional approaches....
This paper relates to the separation of single channel source signals from a single mixed\nsignal by means of independent component analysis (ICA). The proposed idea lies in a\ntime-frequency representation of the mixed signal and the use of ICA on spectral rows\ncorresponding to different time intervals. In our approach, in order to reconstruct true sources, we\nproposed a novelty idea of grouping statistically independent time-frequency domain (TFD)\ncomponents of the mixed signal obtained by ICA..................................
Loading....